Skip to content

chore: Refactor task schedulers to use context.Context for lifecycle management#7746

Merged
Shaddoll merged 1 commit intocadence-workflow:masterfrom
Shaddoll:scheduler
Feb 25, 2026
Merged

chore: Refactor task schedulers to use context.Context for lifecycle management#7746
Shaddoll merged 1 commit intocadence-workflow:masterfrom
Shaddoll:scheduler

Conversation

@Shaddoll
Copy link
Member

@Shaddoll Shaddoll commented Feb 25, 2026

What changed?
Refactor task schedulers to use context.Context instead of a channel for lifecycle management.

Why?
context.Context is the better default for lifecycle management. It supports propagation, deadline/timeout and is the standard convention in Go.

How did you test it?
unit tests
cd common/task && go test ./...

Potential risks
N/A

Release notes
N/A

Documentation Changes
N/A


Reviewer Validation

PR Description Quality (check these before reviewing code):

  • "What changed" provides a clear 1-2 line summary
    • Project Issue is linked
  • "Why" explains the full motivation with sufficient context
  • Testing is documented:
    • Unit test commands are included (with exact go test invocation)
    • Integration test setup/commands included (if integration tests were run)
    • Canary testing details included (if canary was mentioned)
  • Potential risks section is thoughtfully filled out (or legitimately N/A)
  • Release notes included if this completes a user-facing feature
  • Documentation needs are addressed (or noted if uncertain)

@gitar-bot
Copy link

gitar-bot bot commented Feb 25, 2026

🔍 CI failure analysis for ec19356: Two activity heartbeat integration tests are consistently failing across multiple CI runs: TestActivityHeartbeatTimeouts and TestActivityHeartbeatDetailsDuringRetry. The failures reproduce consistently, indicating a regression related to the task scheduler refactoring.

Issue

Two activity heartbeat-related integration tests are consistently failing across multiple CI runs:

  1. TestActivityHeartbeatTimeouts - Expected activities to timeout but they completed successfully instead

    • Error at activity_test.go:1107: "Expected activities to timeout but seeing completion instead"
    • Test duration: ~18-19s (consistent across runs)
    • Job IDs: 64898331275, 64902727411
  2. TestActivityHeartbeatDetailsDuringRetry - Wrong retry count observed

    • Error at activity_test.go:317: Expected retry count of 2 but got 1
    • Test duration: ~61s (consistent across runs)
    • Job IDs: 64898331275, 64902727411

Root Cause

The consistent reproduction of these failures indicates this is likely a genuine regression caused by the task scheduler refactoring, not test flakiness.

The PR refactors task schedulers in common/task/ to use context.Context instead of channels:

  • Changed files: fifo_task_scheduler.go, weighted_round_robin_task_scheduler.go
  • Changes: Replaced shutdownCh chan struct{} with ctx context.Context and cancel context.CancelFunc
  • Behavior change: Using ctx.Done() instead of <-shutdownCh

These task schedulers are used by the history service task processor (service/history/task/processor.go), which handles activity timer tasks. The timing difference between context cancellation and channel closure may be affecting:

  • Activity heartbeat timeout detection
  • Task processing timing in transfer/timer queues
  • Activity retry scheduling

Details

Why context.Context might behave differently:

  1. Context cancellation propagation: When cancel() is called, it marks the context as done and closes an internal channel. This may have microscopically different timing characteristics than directly closing a channel.

  2. Task queue processing: The weighted round robin task scheduler is used in history service task processing. Any timing difference in shutdown/cancellation could affect:

    • When tasks are drained from queues
    • When activity timeout tasks are processed
    • The order of task processing during concurrent operations
  3. Activity heartbeat behavior: The tests specifically verify timing-sensitive behavior:

    • TestActivityHeartbeatTimeouts: Expects activities to timeout after heartbeat intervals expire
    • TestActivityHeartbeatDetailsDuringRetry: Expects specific retry counts based on heartbeat failure detection

If task processing is even slightly delayed or accelerated, it could cause:

  • Activities to complete before timeout detection triggers (Test 1)
  • Retry logic to execute fewer times than expected (Test 2)
Code Review ✅ Approved

Clean mechanical refactoring replacing shutdownCh with context.Context. All changes are semantically equivalent, idiomatic Go, and well-tested.

Rules ❌ No requirements met

Repository Rules

PR Description Quality Standards: Expand Why section to explain previous shutdownCh implementation and its limitations; add -v flag to test command

2 rules not applicable. Show all rules by commenting gitar display:verbose.

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@Shaddoll Shaddoll enabled auto-merge (squash) February 25, 2026 21:18
@Shaddoll Shaddoll merged commit d52ae40 into cadence-workflow:master Feb 25, 2026
75 of 77 checks passed
@Shaddoll Shaddoll deleted the scheduler branch February 25, 2026 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants